Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize LIKE with custom escape char #7730

Conversation

xumingming
Copy link
Contributor

@xumingming xumingming commented Nov 26, 2023

Currently we optimize LIKE operation only if escape char is not specified,
this PR adds the ability to apply the optimization even if user specifies
escape char. We introduced a PatternStringIterator which handles escaping
transparently, so existing optimizations(kPrefix, kSuffix, kSubstring etc)
now work for patterns with escape char transparently, and future
optimizations will have effect for escaped pattern transparently too.

The benchmark result before this optimization:

============================================================================
[...]hmarks/ExpressionBenchmarkBuilder.cpp     relative  time/iter   iters/s
============================================================================
like_generic##like_generic                                   4.48s   223.04m
----------------------------------------------------------------------------
----------------------------------------------------------------------------
like_prefix##like_prefix                                     1.26s   791.19m
like_prefix##starts_with                                  124.82ms      8.01
like_substring##like_substring                               4.23s   236.49m
like_substring##strpos                                    247.50ms      4.04
like_suffix##like_suffix                                     3.23s   309.58m
like_suffix##ends_with                                    127.47ms      7.85

After:

============================================================================
[...]hmarks/ExpressionBenchmarkBuilder.cpp     relative  time/iter   iters/s
============================================================================
like_generic##like_generic                                   3.80s   263.33m
----------------------------------------------------------------------------
----------------------------------------------------------------------------
like_prefix##like_prefix                                  123.08ms      8.13
like_prefix##starts_with                                  123.66ms      8.09
like_substring##like_substring                            191.93ms      5.21
like_substring##strpos                                    237.96ms      4.20
like_suffix##like_suffix                                  127.67ms      7.83
like_suffix##ends_with                                    125.07ms      8.00

In Summary:

  • Speedup of kSubstring is about 20x.
  • Speedup of kPrefix is about 10x.
  • Speedup of kSuffix is about 20x.

We can confirm the speedup is reasonable from the comparison between our
optmizations and the simple scalar function strpos, starts_with, ends_with, the
performance numbers are quite close(see the like##strpos/starts_with/ends_with
in the benchmark result for more details).

Copy link

netlify bot commented Nov 26, 2023

Deploy Preview for meta-velox canceled.

Name Link
🔨 Latest commit 0f43783
🔍 Latest deploy log https://app.netlify.com/sites/meta-velox/deploys/6579c8d919155d00081e0962

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Nov 26, 2023
@xumingming xumingming force-pushed the optimize_like_when_user_specified_escape_char branch from c4f1f23 to 0bb581d Compare November 26, 2023 12:49
@xumingming
Copy link
Contributor Author

@mbasmanova Do you have time review this one?

Copy link
Contributor

@mbasmanova mbasmanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xumingming Thank you for working on this. I wonder if you'd be interested in helping with #7685 as well. Also, curious how much faster optimized version is? Would you add a benchmark or extend an existing one? See velox/benchmarks/ExpressionBenchmarkBuilder.h

@xumingming
Copy link
Contributor Author

xumingming commented Nov 30, 2023

@mbasmanova Seems I have some misunderstanding here, I thought this PR would fix #7685 , for pattern %xxx_xxx_xxx%, I was thinking _ here is a literal _ rather than a single char wildcard, so both _ and % here are wildcard here right?

But anyway, support LIKE optimization when the escape char is specified is valuable, right?

@mbasmanova
Copy link
Contributor

@xumingming James, we are seeing users write queries like this: SELECT * FROM t WHERE a LIKE b || '%' .

Here, LIKE pattern appears as a prefix pattern, but when 'b' contains underscores the pattern is actually not a simple prefix patter. The user intent is of course for a prefix pattern, but we cannot assume that. The underscores happen often when a is a URL. Hence, it would be nice to optimize for prefix, suffix and substring patterns that contain underscores (wildchar).

LIKE optimization when the escape char is specified is valuable, right?

Theoretically, yes, but I haven't seen these in practice.

@xumingming
Copy link
Contributor Author

xumingming commented Dec 4, 2023

Added a benchmark to test the like optimization when escape char is specified, before this optimization:

============================================================================
[...]hmarks/ExpressionBenchmarkBuilder.cpp     relative  time/iter   iters/s
============================================================================
like##like_substring                                       22.59ms     44.26
like##like_prefix                                          14.08ms     71.00
like##like_suffix                                          20.18ms     49.56
like##like_generic                                         21.37ms     46.80
----------------------------------------------------------------------------
----------------------------------------------------------------------------

after:

============================================================================
[...]hmarks/ExpressionBenchmarkBuilder.cpp     relative  time/iter   iters/s
============================================================================
like##like_substring                                      586.50us     1.71K
like##like_prefix                                         361.83us     2.76K
like##like_suffix                                         361.83us     2.76K
like##like_generic                                         21.40ms     46.72
----------------------------------------------------------------------------
----------------------------------------------------------------------------

@xumingming
Copy link
Contributor Author

Hi @laithsakka @mbasmanova , can you help review this one? I will implement #7685 base on this PR, should be relatively easy.

Copy link
Contributor

@mbasmanova mbasmanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some questions.

std::string fixedPattern = "";
// Contains the unescaped fixed pattern in patterns of kind kFixed, kPrefix,
// kSuffix and kSubstring.
std::string unescapedPattern = "";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we could keep the original name 'fixedPattern'. The new name causes lots of changes that make it hard to review this PR and also is confusing to me as I need to think what does 'unescaped' mean.

vector_size_t start,
vector_size_t end,
std::optional<char> escapeChar);
std::string unescape(StringView pattern, std::optional<char> escapeChar);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add empty line between functions


/// Return the unescaped string for the specified string range, if escape char
/// is not specified just return the corresponding substring.
std::string unescape(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we move this function to .cpp file?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left it here to facilitate testing, will move it.


/// An Iterator that provides methods(hasNext, next) to iterate through a
/// pattern string.
class PatternStringIterator {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we move this class to .cpp file?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left it here to facilitate testing, will move it.

}

private:
StringView pattern_;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

const here and for escapeChar_

return true;
};

while (iterator.hasNext()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not following this logic. Would you explain what it does?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Function determinePatternKind has two parts:

  1. The first part collect stats about the pattern: e.g. where the first wildcard occurs, where the first fixedPattern occurs etc.
  2. The second part deduce the optimized pattern based on the stats collected in the first part.

The change I made is mainly to the first part, I introduced a dedicated struct: PatternStringIterator, the reason is : in the previous code, we advance the cursor by simply i++, now to support escape char we have some state to update when advancing the cursor:

  /// Advance the cursor to next char.
  void next() {
    currentIndex_++;
    previousCharKind_ = charKind_;

    auto currentChar = pattern_.data()[currentIndex_];
    if (previousCharKind_ != CharKind::kEscape && currentChar == escapeChar_) {
      charKind_ = CharKind::kEscape;
    } else if (
        previousCharKind_ != CharKind::kEscape && currentChar != escapeChar_ &&
        (currentChar == '_' || currentChar == '%')) {
      charKind_ = CharKind::kWildcard;
    } else {
      charKind_ = CharKind::kNormal;
    }
  }

Introducing PatternStringIterator will make the logic easier to read & maintain.

auto unescapedPattern = unescape(pattern, 0, patternLength, escapeChar);
return PatternMetadata{
PatternKind::kFixed,
(vector_size_t)unescapedPattern.size(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

vector_size_t should be used for row numbers and vector sizes, but not for size of a string. Why is it used here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently PatternMetadata::length is of type vector_size_t , will change it to size_t.

return PatternMetadata{
PatternKind::kSuffix, patternLength - fixedPatternStart};
PatternKind::kSuffix,
(vector_size_t)unescapedPattern.size(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to pass both unescapedPattern and its size?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously, when we do not support escape char, we don't need to pass pattern to PatternMetadata, because we can deduce the suffix pattern from the original pattern and size, take %ab as an example:

pattern = %ab, size = 2   -> suffix_pattern = ab

now we support escape char, with only size, we can not deduce the actual suffix_pattern, take %a\_b as example(\ is the escape char), since the unescaped pattern is a_b, the size is 3, but if we get the last 3 letters from the original pattern, it is \_b, which is wrong, so we need to pass both unescaped pattern and size.

// Prepare data which contains/prefix with/suffix with the string 'a_b_c'
for (int i = 0; i < vectorSize; i++) {
substringInput->set(
i, StringView::makeInline(fmt::format("{}a_b_c{}", i, i)));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The benchmark shows 40x speed up. Is this right? I noticed you use only inline strings. Will results change at all if strings are longer?

.addExpression("like_prefix", R"(like (col1, 'a\_b\_c%', '\'))")
.addExpression("like_suffix", R"(like (col2, '%a\_b\_c', '\'))")
.addExpression("like_generic", R"(like (col3, '%a%b%c'))")
.withIterations(100)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this needed?

@xumingming
Copy link
Contributor Author

Addressed all review comments, updated the benchmark to use longer strings, will paste the new benchmark result later.

Copy link
Contributor

@mbasmanova mbasmanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some follow-up questions.

@@ -483,7 +483,7 @@ class OptimizedLike final : public VectorFunction {
}

private:
StringView pattern_;
std::string pattern_;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

const

@@ -427,12 +427,12 @@ bool matchSubstringPattern(
template <PatternKind P>
class OptimizedLike final : public VectorFunction {
public:
OptimizedLike(StringView pattern, vector_size_t reducedPatternLength)
OptimizedLike(std::string pattern, vector_size_t reducedPatternLength)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use std::move to avoid extra copying

pattern_{std::move(pattern)},

}
PatternMetadata patternMetadata =
determinePatternKind(pattern, escapeChar);
vector_size_t reducedLength = patternMetadata.length;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

const auto

PatternMetadata patternMetadata =
determinePatternKind(pattern, escapeChar);
vector_size_t reducedLength = patternMetadata.length;
auto fixedPattern = patternMetadata.fixedPattern;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

const auto& to avoid extra copy


std::ostringstream os;
// Append the specified range to the stream.
auto appendNormalCharsFn = [&](const char* start, const char* end) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

appendNormalCharsFn -> appendChars (no need for "normal" and "fn")

os << current;
} else {
// Escape char not found, append all the non-escape chars.
appendNormalCharsFn(previous, cursor);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: perhaps, use endCursor add 'break' here for readability

      appendChars(previous, endCursor);
      break;

if (iterator.previousState() == PatternStringIterator::CharKind::kEscape) {
// The char follows escapeChar can only be one of (%, _, escapeChar).
if (current != '%' && current != '_' && current != escapeChar) {
fallbackToGeneric = true;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we reach here, does it mean that the pattern is not valid? If so, should we fail?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but seems we can not throw exception here, exception need to be set to the context? which we don't have access to here, so here we will fallback to generic and let generic implementation handle the exception.

auto error = std::make_exception_ptr(std::invalid_argument(
    "Escape character must be followed by '%', '_' or the escape character itself"));
context.setErrors(rows, error);

iterator.next();
auto current = iterator.current();

if (iterator.previousState() == PatternStringIterator::CharKind::kEscape) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The iterator abstraction is supposed to hide the details of how escape characters are processed. I'd expect it to allow us to write code as if we don't know anything about escape characters. However, I see that we are handling escape chars both in the iterator and in the code that uses the iterator. Is this intentional?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my initial design, PatternStringIterator does not encapsulate the escape state, but I do agree with you that let it encapsulate escape state will make its user much easier.

@xumingming
Copy link
Contributor Author

before:

============================================================================
[...]hmarks/ExpressionBenchmarkBuilder.cpp     relative  time/iter   iters/s
============================================================================
like##like_substring                                       41.03ms     24.37
like##like_prefix                                          21.04ms     47.53
like##like_suffix                                          40.94ms     24.43
like##like_generic                                         40.86ms     24.47
----------------------------------------------------------------------------
----------------------------------------------------------------------------

after:

============================================================================
[...]hmarks/ExpressionBenchmarkBuilder.cpp     relative  time/iter   iters/s
============================================================================
like##like_substring                                       78.73us    12.70K
like##like_prefix                                          42.70us    23.42K
like##like_suffix                                          42.76us    23.39K
like##like_generic                                         39.44ms     25.35
----------------------------------------------------------------------------
----------------------------------------------------------------------------

@xumingming
Copy link
Contributor Author

Hi @mbasmanova , can you help review again? comments addressed.

Copy link
Contributor

@mbasmanova mbasmanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xumingming Thank you for iterating on this PR. I'm still a bit confused about the code and have some suggestions to refactor for clarity.

std::ostringstream os;
// Append the specified range to the stream.
auto appendChars = [&](const char* start, const char* end) {
for (auto it = start; it < end; it++) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not os.write(start, end - start)?

https://cplusplus.com/reference/ostream/ostream/write/

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch!

CharKind userPreviousCharKind_ = CharKind::kNormal;

/// Advance the cursor to next char.
void nextInternal() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move methods before member variables

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

// previousCharKind_ is kEscape, while userPreviousCharKind_ is kNormal.
CharKind userPreviousCharKind_ = CharKind::kNormal;

/// Advance the cursor to next char.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use // for comments in .cpp files

use /// for comments on public methods in .h files

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

size_t singleCharacterWildcardCount = 0;

PatternStringIterator iterator{pattern, escapeChar};
bool fallbackToGeneric = false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of falling back to 'generic', we should return exec::AlwaysFailingVectorFunction.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Current function determinePatternKind returns PatternKind, it has no knowledge about AlwaysFailingVectorFunction?

velox/functions/lib/Re2Functions.cpp Show resolved Hide resolved
// Kind of current char.
CharKind charKind_ = CharKind::kNormal;
// Kind of previous char(literally).
CharKind previousCharKind_ = CharKind::kNormal;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The link seems broken, we need previousCharKind_ to track escape state?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm... Can you try #7933 ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mbasmanova Applied your change.

// The difference between previousCharKind_ and realPreviousCharKind_ can be
// described by an example: for pattern string 'a\_b', if the cursor is at '_'
// previousCharKind_ is kEscape, while userPreviousCharKind_ is kNormal.
CharKind userPreviousCharKind_ = CharKind::kNormal;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems we can simplify some more. We can replace this with a bool isPreviousWildcard_; remove nextInternal; and remove CharKind::kEscape.

See last commit in #7933


benchmarkBuilder
.addBenchmarkSet(
"like",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we rename the file to LikeBenchmark.cpp so we can add benchmarks for like without escape char in a follow-up?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just in case you didn't notice that there is another LikeFunctionsBenchmark.cpp, but I find it not easy to change it to add like + escape benchmark into it, so I started this new benchmark.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the pointer. I didn't realize that. Looks like that other benchmark is targeting LIKE patterns found in TPC-H benchmark. Maybe we could rename it to LikeTpchBenchmark and use LikeBenchmark name for the new more generic benchmark you have started. What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good to me, renamed.

return StringView(temp);
},
nullptr);
auto suffixInput = vectorMaker.flatVector<facebook::velox::StringView>(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use flatVectorstd::string

@xumingming
Copy link
Contributor Author

xumingming commented Dec 8, 2023

I’m not sure if there is something wrong with the design of the benchmark, the improvement seems huge.

Before(goes Generic):

============================================================================
[...]hmarks/ExpressionBenchmarkBuilder.cpp     relative  time/iter   iters/s
============================================================================
like##substring                                            40.99ms     24.40
like##prefix                                               20.97ms     47.69
like##suffix                                               40.52ms     24.68
like##generic                                              40.41ms     24.74
----------------------------------------------------------------------------
----------------------------------------------------------------------------

After(goes corresponding optimization):

============================================================================
[...]hmarks/ExpressionBenchmarkBuilder.cpp     relative  time/iter   iters/s
============================================================================
like##substring                                            83.71us    11.95K
like##prefix                                               44.34us    22.55K
like##suffix                                               42.37us    23.60K
like##generic                                              36.70ms     27.25
----------------------------------------------------------------------------
----------------------------------------------------------------------------

return std::string("x", row);
}
},
nullptr);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need to pass nullptr as it is the default value

.addExpression("prefix", R"(like (col1, 'a\_b\_c%', '\'))")
.addExpression("suffix", R"(like (col2, '%a\_b\_c', '\'))")
.addExpression("generic", R"(like (col0, '%a%b%c'))")
.withIterations(10)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why 10?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually no reason, this statement can be removed to use the default value: 1000.

@xumingming
Copy link
Contributor Author

@mbasmanova Do you have further comments?

@mbasmanova
Copy link
Contributor

@xumingming James, have you been able to resolve the below concern?

I’am not sure if there is something wrong with the design of the benchmark, the improvement seems huge.

@mbasmanova mbasmanova changed the title Optimize LIKE when user specified escape char Optimize LIKE with custom escape char Dec 12, 2023
Copy link
Contributor

@mbasmanova mbasmanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xumingming Looks great % a few nits. Please, update PR description to explain all the changes in this PR and benchmark results.


auto substringInput =
vectorMaker.flatVector<std::string>(vectorSize, [&](auto row) {
// Only when the number is even we make a string contains a substring
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: // Strings in even rows contain a_b_c.


auto prefixInput =
vectorMaker.flatVector<std::string>(vectorSize, [&](auto row) {
// Only when the number is even we make a string starts with a_b_c.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: // Strings in even rows start with a_b_c


auto suffixInput =
vectorMaker.flatVector<std::string>(vectorSize, [&](auto row) {
// Only when the number is even we make a string ends with a_b_c.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: // Strings in even rows end with a_b_c.

const vector_size_t vectorSize = 1000;
auto vectorMaker = benchmarkBuilder.vectorMaker();

auto substringInput =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a good amount of repetition here. Consider extracting helper function:

auto substringInput = makeInput(vectorMaker, vectorSize, [] (auto row) { return ...a_b_c...;});

auto prefixInput = makeInput(vectorMaker, vectorSize, [] (auto row) { return a_b_c...;});

auto suffixInput = makeInput(vectorMaker, vectorSize, [] (auto row) { return ...a_b_c;});

or use a pair of booleans to indicate whether to pad strings at the beginning and at end.

auto substringInput = makeInput(vectorMaker, vectorSize, true, true);

auto prefixInput = makeInput(vectorMaker, vectorSize, false, true);

auto suffixInput = makeInput(vectorMaker, vectorSize, true, false);

velox/benchmarks/basic/LikeBenchmark.cpp Show resolved Hide resolved
return os.str();
}

// An iterator to iterate through a pattern string. It will handle escaping
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: "It will handle" -> "Handles escaping transparently"

or

// Iterates through a pattern string. Transparently handles escape sequences.

escapeChar_(escapeChar),
lastIndex_{pattern_.size() - 1} {}

// Advance the cursor to next char, escape char is automatically handled.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: perhaps, clarify that this method returns 'false' if at end.

return pattern_.data()[currentIndex_];
}

const StringView pattern_;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel we'll have to take another pass, maybe in a follow-up PR, to replace StringView with std::string_view. StringView is really meant to store strings in vectors and is not a replacement for light-weight std::string_view. For example, for short strings we end up copying strings when passing StringView by value and for these strings it is not safe to store a pointer to data().

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, I will submit a dedicated a PR for StringView change.

const size_t lastIndex_;

int32_t currentIndex_ = -1;
CharKind charKind_ = CharKind::kNormal;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use {} consistently

@xumingming
Copy link
Contributor Author

@xumingming James, have you been able to resolve the below concern?

I’am not sure if there is something wrong with the design of the benchmark, the improvement seems huge.

@mbasmanova After some experiments, I think there are two reasons for the huge performance improvement:

  • Re2 is really slow compare to the optimizations we made, even if the input string is short(10 byte), Re2 is 100x slower than our optimizations.
  • When the input strings get longer(10bytes -> 1000bytes), the performance of our optimizations does not change much, but Re2's performance will be 10x slower.

And we can confirm the speedup is reasonable from the comparison between our optmizations and the simple scalar function strpos, starts_with, ends_with(as you suggested, really good idea :)), the performance nums are quite close(see the PR description for more details).

@mbasmanova
Copy link
Contributor

I think there are two reasons for the huge performance improvement:

@xumingming Thank you for digging in and sharing your findings. These make sense. Let's add this context to the PR description to make it easily available for future readers. Awesome job!

.addExpression("like_suffix", R"(like(col2, '%a\_b\_c', '\'))")
.addExpression("like_generic", R"(like(col0, '%a%b%c'))")
.addExpression("strpos", R"(strpos(col0, 'a_b_c'))")
.addExpression("starts_with", R"(starts_with(col1, 'a_b_c'))")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder how does this work? Don't we need to split these benchmarks into 3 different sets?

// This class represents a set of expressions to benchmark those expressions
// uses the same input vector type, and are expected to have the same result
// if testing is not disabled for the set.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CastBenchmark.cpp has similar usage? is the comment you refered to is out of date? (Have not look into the implementation details of ExpressionBenchmarkSet yet.)

  benchmarkBuilder
      .addBenchmarkSet(
          "cast",
          vectorMaker.rowVector(
              {"valid",
               "empty",
               "nan",
               "decimal",
               "short_decimal",
               "long_decimal",
               "timestamp"},
              {validInput,
               invalidInput,
               nanInput,
               decimalInput,
               shortDecimalInput,
               longDecimalInput,
               timestampInput}))
      .addExpression("try_cast_invalid_empty_input", "try_cast (empty as int) ")
      .addExpression(
          "tryexpr_cast_invalid_empty_input", "try (cast (empty as int))")
      .addExpression("try_cast_invalid_nan", "try_cast (nan as int)")
      .addExpression("tryexpr_cast_invalid_nan", "try (cast (nan as int))")
      .addExpression("try_cast_valid", "try_cast (valid as int)")
      .addExpression("tryexpr_cast_valid", "try (cast (valid as int))")
      .addExpression("cast_valid", "cast(valid as int)")
      .addExpression(
          "cast_decimal_to_inline_string", "cast (decimal as varchar)")
      .addExpression("cast_short_decimal", "cast (short_decimal as varchar)")
      .addExpression("cast_long_decimal", "cast (long_decimal as varchar)")
      .addExpression("cast_timestamp", "cast (timestamp as varchar)")

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CastBenchmark has .disableTesting();

CC: @laithsakka

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mbasmanova After digging into the details of ExpressionBenchmarkBuilder, I find my understanding of it was not quite right, updated the benchmark.

@xumingming
Copy link
Contributor Author

I think there are two reasons for the huge performance improvement:

@xumingming Thank you for digging in and sharing your findings. These make sense. Let's add this context to the PR description to make it easily available for future readers. Awesome job!

Updated the PR description.

Copy link
Contributor

@mbasmanova mbasmanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.

@facebook-github-bot
Copy link
Contributor

@mbasmanova has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@mbasmanova
Copy link
Contributor

@xumingming Would you rebase this PR so I can try to merge?

Currently we have optimization for LIKE only when user does not
specify escape char, this commit provides optimization(kPrefix,
kSuffix, kFixed, kSubstring) when user specified escape char.
@xumingming xumingming force-pushed the optimize_like_when_user_specified_escape_char branch from 293c487 to 0d17015 Compare December 13, 2023 00:53
@facebook-github-bot
Copy link
Contributor

@mbasmanova has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

// Strings in even rows contain/start with/end with a_b_c depends on
// value of padAtHead && padAtTail.
if (row % 2 == 0) {
auto padding = std::string("x", row / 2 + 1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be std::string (n, 'x')

I'll fix. Just FYI.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mbasmanova Sorry for this issue, I re-runed the benchmark and update the benchmark results in the PR description, now the speedup is about 10x to 20x, more resonable :)

return std::string("a_b_c");
}
} else {
return std::string("x", row);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

@mbasmanova
Copy link
Contributor

FYI, I get the following results on the benchmark

============================================================================
[...]hmarks/ExpressionBenchmarkBuilder.cpp     relative  time/iter   iters/s
============================================================================
like_generic##like_generic                                   1.12s   892.13m
like_prefix##like_prefix                                   14.21ms     70.38
like_prefix##starts_with                                   13.55ms     73.79
like_substring##like_substring                             28.24ms     35.41
like_substring##strpos                                     23.22ms     43.07
like_suffix##like_suffix                                   12.43ms     80.43
like_suffix##ends_with                                     13.13ms     76.19

size_t end,
std::optional<char> escapeChar) {
if (!escapeChar) {
return std::string(pattern.data(), start, end - start);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be std::string(pattern.data() + start, end - start); Otherwise, lots of ASAN failures.

=================================================================
==293875==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x611000e5b4c0 at pc 0x000000319269 bp 0x7ffff14e7e60 sp 0x7ffff14e7620
READ of size 129 at 0x611000e5b4c0 thread T0
SCARINESS: 26 (multi-byte-read-heap-buffer-overflow)
    #0 0x319268 in __interceptor_strlen.part.0 ubsan.c
    #1 0x7f069dbf009d in std::char_traits<char>::length(char const*) fbcode/third-party-buck/platform010/build/libgcc/include/c++/trunk/bits/char_traits.h:371
    #2 0x7f069dbeff08 in std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>::basic_string<std::allocator<char>>(char const*, std::allocator<char> const&) fbcode/third-party-buck/platform010/build/libgcc/include/c++/trunk/bits/basic_string.h:536
    #3 0x7f069ddac65c in facebook::velox::functions::unescape[abi:cxx11](facebook::velox::StringView, unsigned long, unsigned long, std::optional<char>) fbcode/velox/functions/lib/Re2Functions.cpp:979
    #4 0x7f069ddaf13f in facebook::velox::functions::determinePatternKind(facebook::velox::StringView, std::optional<char>) fbcode/velox/functions/lib/Re2Functions.cpp:1203
    #5 0x7f069ddbca05 in facebook::velox::functions::makeLike(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, std::vector<facebook::velox::exec::VectorFunctionArg, std::allocator<facebook::velox::exec::VectorFunctionArg>> const&, facebook::velox::core::QueryConfig const&) fbcode/velox/functions/lib/Re2Functions.cpp:1278

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this warning a false positive or the original expression is indeed wrong?

@facebook-github-bot
Copy link
Contributor

@mbasmanova merged this pull request in 1779e82.

Copy link

Conbench analyzed the 1 benchmark run on commit 1779e82d.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details.

facebook-github-bot pushed a commit that referenced this pull request Dec 14, 2023
…#8040)

Summary:
Expression:

```cpp
like(R"(abcdef\abcdef)", R"(%\\abc%)", '\\')
```

Error:

```
Error Source: USER
Error Code: INVALID_ARGUMENT
Reason: Escape character must be followed by '%', '_' or the escape character itself
Retriable: False
Expression: current == escapeChar || current == '_' || current == '%'
Function: unescape
File: ../../velox/functions/lib/Re2Functions.cpp
```

The bug is a regression caused by #7730

Pull Request resolved: #8040

Reviewed By: Yuhta

Differential Revision: D52160961

Pulled By: mbasmanova

fbshipit-source-id: 9a777222ee0dc3be9ac0323ee5ca47ab95016f63
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. Merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants